Biological Pattern Discovery with R Machine Learning Approaches (Zheng Rong Yang)

ncing reads hitting a gene of a replicate, i.e., the sequencing count

plicate for one gene. Table 6.4 shows such a count matrix used to

how genes contribute to the airway smooth muscle cytokine

[Himes, et al., 2014], where there were two experimental

s and each condition had four replicates. The objective of

g such a sequencing count matrix was to find out which subset of

differentially expressed across two conditions.

A count matrix after the sequencing reads have been mapped to a reference

ach count represents the times the sequencing reads hit a gene. Gene IDs were

by removing prefix ENSG00000000 and sample IDs were shortened by

refix SRR10395. This means the full ID of gene 003 was ENSG00000000003

ID of sample 08 was SRR1039508.

723

486

904

445

1170

1097

806

604

467

523

616

371

582

781

417

509

347

258

364

237

318

447

330

324

118

102

equencing count data are different from the microarray data as

a are non-negative integers. Therefore, limma may not be very

The negative binomial distribution has been employed for

ng DEGs for the sequencing count data [Robinson, et al., 2009;

nd Huber, 2010].

cover DEGs for sequencing count data using DESeq2

is a package developed for gene differential expression pattern

y based on a sequencing count data set [Love, et al., 2014]. The

g to do is to generate a design matrix. Table 6.5 shows a design

r the data shown in Table 6.4. In this matrix, the column labelled

as for the names of replicate. The column labelled by dex was

fying two experimental conditions, namely control and treated

y comparing the column id and the column dex, all samples in